// Load the NLSY dataset
webuse nlswork
// Summarize the dataset
summarize
// tabulate college and union status
tabulate collgrad union
// generate a variable for log hours worked
generate log_hours = log(hours)
// plot log hours against year separately for union and non-union
graph twoway scatter log_hours year if union == 0
graph twoway scatter log_hours year if union == 1
// Regress hours worked against union status
regress hours union
// What is the identification assumption for the coefficient on union to be causal?
// There are no omitted variables. Hours worked is exogenous of union status.
// Regress hours worked against college graduate with individual (idcode) and year fixed effects
reghdfe hours union, absorb(idcode year)
// What is the identification assumption for the coefficient on union to be causal?
// Union is endogenous of the fixed effects. Cannot infer causality without controlling for fixed effects.
// Now cluster standard errors. What level should you cluster at and why?
// What do you notice about the coefficients?
reghdfe hours union, absorb(idcode year) cluster(year)
// Standard error seems to decrease when clustering by year
// Now add fixed effects for occupation code
// In terms of causal inference, why is adding occupation code
// as a control probably not a good idea?
reghdfe hours union, absorb(idcode year occ_code)
// Occupation could be determined by the union variable and also correlated with the hours worked. It is a collider.
// What are "good controls" in the dataset in terms of being exogenous to union status?
// add them to the regression, interacted with year, using this syntax:
// absorb( ... year##c.(var1 var2 ...))
// how do these change your estimates?
reghdfe hours union, absorb(year##c.(age race c_city))
// some of the variable names are not clear...
Find a news article mistaking correlation for causation. Link to the article and write a short paragraph explaining the mistake.
[type your comments here, or attach as separate document.]
https://tinyurl.com/y6jtqf9x
The link between vitamin D deficiency and COVID mobidity and risk of infection have been widely circulating since early September.
While recent research does demonstrate that there is likely a causal relationship, I selected an early article that reports misleading statistics of COVID testing outcomes as caused by vitamin D defiency.
The article quotes: "vitamin D deficiency increases a person's risk for catching COVID-19 by a whopping 77% compared to those who have sufficient levels of the nutrient".
It should be noted that the article failed to consider:
* People who are susceptible to vitamin D deficiency are typically those who are Black, elderly, and others with underlying conditions.
These happen to be the same groups of people with higher rates of infection and death due to socioeconomic and preexisting health factors.
* The original study used historical vitamin D testing results (from up to a year prior). This may not reflect accurately as the patient's vitamin D levels after infection.
Additionally, vitamin D testing is a specialized test, so it raises questions about selection bias in the patients involved in the study.
In summary, while a relationshp between vitamin D levels and COVID infection outcomes may be true, the article presents a misleading fact that vitamin D is responsible for a 77% reduction in COVID risks.